{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Extending Buckaroo\n",
    "Buckaroo is built for exploratory data analysis on unknown data.  Data in the wild is incredibly varied and so are the ways of visualizing it. Most table tools are built around allowing a single bespoke customization, with middle of the road defaults. Buckaroo takes a different approach. Buckaroo lets you build many highly specific configurations and then toggle between them quickly.  This makes it easier to build each configuration because you don't have to solve for every possibility.\n",
    "\n",
    "This document walks you through how to add your own analysis to Buckaroo and allow users to toggle it\n",
    "\n",
    "The extension points are\n",
    "* [PluggableAnalysisFramework](https://buckaroo-data.readthedocs.io/en/latest/articles/pluggable.html) Used to add summary stats and column metadata for use by other steps\n",
    "* [Styling](./Styling-Howto.ipynb) control the visual display of the table\n",
    "* PostProcessing used to transform an entire dataframe\n",
    "* AutoCleaning Automate transformations for dropping nulls, removing outliers and other pre-processing steps, cleans the dataframe and generates python code.  Not yet supported in 0.6\n",
    "\n",
    "Each extension point is composable, and can be interactively mixed and matched"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from buckaroo.dataflow.styling_core import StylingAnalysis\n",
    "import polars as pl\n",
    "from polars import functions as F\n",
    "from buckaroo.polars_buckaroo import PolarsBuckarooWidget"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ROWS = 200\n",
    "typed_df = pl.DataFrame({'int_col':np.random.randint(1,50, ROWS), 'float_col': np.random.randint(1,30, ROWS)/.7,\n",
    "                         'timestamp':[\"2020-01-01 01:00Z\", \"2020-01-01 02:00Z\", \"2020-02-28 02:00Z\", \"2020-03-15 02:00Z\", None] * 40,\n",
    "                         \"str_col\": [\"foobar\", \"Realllllly long string\", \"\", None, \"normal\"]* 40})\n",
    "typed_df = typed_df.with_columns(timestamp=pl.col('timestamp').str.to_datetime() )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "pbw = PolarsBuckarooWidget(typed_df)\n",
    "pbw"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "# Using the Pluggable Analysis Framework\n",
    "\n",
    "The PAF allows users to add summary analysis that runs for every dataframe, and exposes created measures to subsequent steps.\n",
    "There are implementations for pandas and polars.  Individual analysis classes cna depend on other calsess that provide measures, the framwork ensures that they are excecuted in the correct order.\n",
    "\n",
    "These measures form the column metadata used by styling, and the summary information used for pinned rows.\n",
    "\n",
    "You can read more here\n",
    "\n",
    "* https://github.com/paddymul/buckaroo/blob/main/tests/unit/polars_analysis_management_test.py\n",
    "* https://github.com/paddymul/buckaroo/blob/main/buckaroo/customizations/polars_analysis.py\n",
    "\n",
    "The following cell adds a 99th quintile measure and displays it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from buckaroo.pluggable_analysis_framework.utils import (json_postfix)\n",
    "import polars.selectors as cs\n",
    "class Quin99Analysis(StylingAnalysis):\n",
    "    select_clauses = [\n",
    "        cs.numeric().quantile(.99).name.map(json_postfix('quin99'))]\n",
    "    \n",
    "    pinned_rows = [{'primary_key_val': 'quin99', 'displayer_args': {'displayer': 'obj' }}]\n",
    "    df_display_name = 'quin99'\n",
    "    data_key = \"empty\"  # the non pinned rows will pull from the empty dataframe\n",
    "\n",
    "sbw = PolarsBuckarooWidget(typed_df)\n",
    "sbw.add_analysis(Quin99Analysis)\n",
    "sbw"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "# Adding a styling analysis\n",
    "The `StylingAnalysis` class is used to control the display of a column based on the column metadata.  \n",
    "\n",
    "\n",
    "Overriding the `config_from_column_metadata(col:str, sd:SingleColumnMetadata) -> ColumnConfig` computes the config for a single column given that column's metadata.\n",
    "\n",
    "This lets you customize based on metadata collected about a column.  This works with the [PluggableAnalysisFramework](https://buckaroo-data.readthedocs.io/en/latest/articles/pluggable.html),  you can specify required fields that are necessary.  Adding requirements like this guarantees that errors are spotted early.\n",
    "\n",
    "The same StylingAnalysis class can generally work for both Polars and Pandas because it only receives a dictionary with simple python values.\n",
    "\n",
    "The following cell defines two StylingAnalysis, one that shows great detail `everything` the other shows shortened versions `Abrev`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "class EverythingStyling(StylingAnalysis):\n",
    "    \"\"\"\n",
    "    This styling shows as much detail as possible\n",
    "    \"\"\"\n",
    "    df_display_name = \"Everything\"\n",
    "    requires_summary = [\"histogram\", \"is_numeric\", \"dtype\", \"_type\"]\n",
    "    pinned_rows = [{'primary_key_val': 'dtype', 'displayer_args': {'displayer': 'obj' }}]\n",
    "\n",
    "    #Styling analysis handles column iteration for us.\n",
    "    @classmethod\n",
    "    def style_column(kls, col:str, column_metadata):\n",
    "        digits = 10\n",
    "        t = column_metadata['_type']\n",
    "        if column_metadata['is_integer']:\n",
    "            disp = {'displayer': 'float', 'min_fraction_digits':0, 'max_fraction_digits':0}\n",
    "        elif column_metadata['is_numeric']:\n",
    "            disp = {'displayer': 'float', 'min_fraction_digits':digits, 'max_fraction_digits':digits}            \n",
    "        elif t == 'temporal':\n",
    "            disp = {'displayer': 'datetimeLocaleString','locale': 'en-US',  'args': {}}\n",
    "        elif t == 'string':\n",
    "            disp = {'displayer': 'string', 'max_length': 100}\n",
    "        else:\n",
    "            disp = {'displayer': 'obj'}\n",
    "        return {'col_name':col, 'displayer_args': disp }\n",
    "\n",
    "class AbrevStyling(StylingAnalysis):\n",
    "    \"\"\"This styling shows shortened versions of columns \"\"\"\n",
    "    requires_summary = [\"histogram\", \"is_numeric\", \"dtype\", \"_type\"]\n",
    "    df_display_name = \"Abrev\"\n",
    "    pinned_rows = []\n",
    "\n",
    "    @classmethod\n",
    "    def style_column(kls, col:str, column_metadata):\n",
    "        digits = 3\n",
    "        t = column_metadata['_type']\n",
    "        if column_metadata['is_integer']:\n",
    "            disp = {'displayer': 'float', 'min_fraction_digits':0, 'max_fraction_digits':0}\n",
    "        elif column_metadata['is_numeric']:\n",
    "            disp = {'displayer': 'float', 'min_fraction_digits':digits, 'max_fraction_digits':digits}\n",
    "        elif column_metadata['dtype'] == pl.Datetime:\n",
    "            disp = {'displayer': 'datetimeLocaleString','locale': 'en-US',  'args': {}}\n",
    "        elif column_metadata['dtype'] == pl.String:\n",
    "            disp = {'displayer': 'string', 'max_length':10}\n",
    "        else:\n",
    "            disp = {'displayer': 'obj'}\n",
    "        return {'col_name':col, 'displayer_args': disp }\n",
    "\n",
    "sbw = PolarsBuckarooWidget(typed_df)\n",
    "sbw.add_analysis(EverythingStyling)\n",
    "sbw.add_analysis(AbrevStyling)\n",
    "sbw"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "Let's look at pinned_rows, they can be modified by setting `pinned_rows` on Buckaroo Instaniation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "# lets add a post processing method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from polars import functions as F\n",
    "from buckaroo.pluggable_analysis_framework.polars_analysis_management import PolarsAnalysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#typed_df.select(F.all(), pl.col('float_col').lt(5).replace(True, \"foo\").replace(False, None).alias('errored_float'))[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "bw = PolarsBuckarooWidget(typed_df)\n",
    "@bw.add_processing\n",
    "def transpose(df):\n",
    "    return df.transpose()\n",
    "bw"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "Here we decide that any value of `float_col` below `20` is an error.  \n",
    "\n",
    "We add a column of `errored_float` with a value of `\"some_error\"` when `float_col` is less than 20\n",
    "\n",
    "Then we use `extra_column_config` to set a color on `float_col` when `errored_float` is not null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "\n",
    "class ValueCountPostProcessing(PolarsAnalysis):\n",
    "    @classmethod\n",
    "    def post_process_df(kls, df):\n",
    "        result_df = df.select(\n",
    "            F.all().value_counts().implode().list.gather(pl.arange(0, 10), null_on_oob=True).explode().struct.rename_fields(['val', 'unused_count']).struct.field('val').prefix('val_'),\n",
    "            F.all().value_counts().implode().list.gather(pl.arange(0, 10), null_on_oob=True).explode().struct.field('count').prefix('count_'))\n",
    "        return [result_df, {}]\n",
    "    post_processing_method = \"value_counts\"\n",
    "\n",
    "class ShowErrorsPostProcessing(PolarsAnalysis):\n",
    "    @classmethod\n",
    "    def post_process_df(kls, df):\n",
    "        result_df = df.select(\n",
    "            F.all(),\n",
    "            #pl.col('float_col').lt(20).replace(True, \"some error\").replace(False, None).alias('errored_float'),\n",
    "            pl.col('float_col').lt(20).alias('errored_float'))\n",
    "        extra_column_config = {\n",
    "            'float_col': {'column_config_override': {\n",
    "                'color_map_config': {\n",
    "                    'color_rule': 'color_not_null',\n",
    "                    'conditional_color': 'red',\n",
    "                    'exist_column': 'errored_float'},\n",
    "                'tooltip_config': { 'tooltip_type':'simple', 'val_column': 'errored_float'}}},\n",
    "            'errored_float': {'column_config_override': {'merge_rule': 'hidden'}}}\n",
    "        return (result_df, extra_column_config)\n",
    "    post_processing_method = \"show_errors\"\n",
    "\n",
    "# In this case we are going to extend PolarsBuckarooWidget so we can take this combination with us\n",
    "base_a_klasses = PolarsBuckarooWidget.analysis_klasses.copy()\n",
    "base_a_klasses.extend([ValueCountPostProcessing, \n",
    "                       ShowErrorsPostProcessing])\n",
    "class VCBuckarooWidget(PolarsBuckarooWidget):\n",
    "    analysis_klasses = base_a_klasses\n",
    "vcb = VCBuckarooWidget(typed_df, debug=False)\n",
    "vcb"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "## Where to use PostProcessing\n",
    "Post processing functions are no argument transformations.  I can't think of a lot of generic whole dataframe operations.\n",
    "\n",
    "`ValueCount` and `Transpose` are generic.  `ShowErrors` depends on two specific columns.\n",
    "\n",
    "I expect Post processing to be very useful for small custom apps built on top of Buckaroo.  When you know the columns and you want a strict set of transforms, PostProcessing is a great fit.\n",
    "\n",
    "\n",
    "\n",
    "Post processing is also useful when combined with a preprocessing function to compare DataFrames\n",
    "\n",
    "Here is some pseudo code\n",
    "```python\n",
    "class ComparePost(ColAnalysis):\n",
    "\n",
    "    @classmethod\n",
    "    def post_process_df(kls, df):\n",
    "        df1,df2 = split_columns(\"|\")\n",
    "        compare_df = run_compare(df1, df2)\n",
    "        return [compare_df, {}]\n",
    "    post_processing_method = 'compare'\n",
    "    \n",
    "class CompareWidget(BuckarooWidget):\n",
    "    analysis_klasses = [ComparePost]\n",
    "    \n",
    "def compare(df1, df2):\n",
    "    joined = pd.concat([prefix_columns(df1, 'df1|'), prefix_columns(df2, 'df21|')])\n",
    "    return CompareWidget(joined)\n",
    "\n",
    "#run this by the following command\n",
    "compare(sales_march_2022_df, sales_march_2023_df)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "# Putting it all together\n",
    "\n",
    "You can compose (combine) the PluggableAnalysisFramework, PostProcessing and Styling into a single widget.  And you can manipulate PostProcessing separately from Styling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from buckaroo.customizations.polars_analysis import (\n",
    "    VCAnalysis, BasicAnalysis, PlTyping,\n",
    "    HistogramAnalysis, ComputedDefaultSummaryStats)\n",
    "from buckaroo.customizations.styling import DefaultSummaryStatsStyling, DefaultMainStyling\n",
    "\n",
    "class KitchenSinkWidget(PolarsBuckarooWidget):\n",
    "    #let's be explicit here and show all of the built in analysis klasses\n",
    "    analysis_klasses = [\n",
    "    # The default analysis methods for Polars\n",
    "    VCAnalysis, BasicAnalysis, PlTyping,\n",
    "    HistogramAnalysis, ComputedDefaultSummaryStats,\n",
    "    # default buckaroo styling\n",
    "    DefaultSummaryStatsStyling, DefaultMainStyling,\n",
    "        \n",
    "    # our Quin99 analysis\n",
    "    Quin99Analysis,  # adds a styling method\n",
    "    #our PostProcessing classes\n",
    "    ValueCountPostProcessing, ShowErrorsPostProcessing,\n",
    "    #our styling methods\n",
    "    EverythingStyling, AbrevStyling]\n",
    "ksw = KitchenSinkWidget(typed_df)\n",
    "ksw"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "# Why aren't there click handlers?\n",
    "\n",
    "Buckaroo doesn't allow arbitrary click handlers and this is by design.  When you allow arbitrary click handlers, you then have to manage state.  If you have noticed, every method of extending buckaroo is a pure function.  Managing application state is difficult and the primary source of errors when building GUIs.\n",
    "\n",
    "Buckaroo is designed purely around displaying DataFrames along with the most common operations that are performed on DataFrames.  If you want more traditional app experiences, right now you can use IPYWidgets and integrate buckaroo into it.  Soon I will be releasing the DFViewer (core component that shows the table) for Streamlit and Solara."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "# What about autocleaning and the low code UI\n",
    "\n",
    "Auto cleaning and the low code UI work together for more fine grained editting of data.  The low code UI presents a gui that works on columns and allows functions with arguments.  \n",
    "\n",
    "Auto cleaning works to suggest operations that are then loaded into the low code ui.  Then these operations can be editted or removed.\n",
    "Auto cleaning options can be cycled through to generate different cleanings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
